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Abstract: 

Malaria control strategies aiming at reducing disease transmission inten- 
sity may impact both oocyst intensity and infection prevalence in the mosquito 
vector. Thus far, mathematical models failed to identify a clear relationship 
between Plasmodium gametocytes and their infectiousness to mosquitoes. Nat- 
ural isolates of gametocytes are genetically diverse and biologically complex. 
Infectiousness to mosquitoes relies on multiple parameters such as density, sex- 
ratio, maturity, parasite genotypes and host immune factors. In this article, 
we investigated how density and genetic diversity of gametocytes impact on 
the success of transmission through the mosquito vector. We analyzed data for 
which the number of variables plus attendant interactions is at least of order 
of the sample size, precluding usage of classical models such as general linear 
models. We then applied a variable selection procedure based on the random 
forests score of variable importance. The selected variables were assessed in the 
zero inflated negative binomial model which accommodates both over-dispersion 
and the sources of non infected mosquitoes. We found that the most important 
variables related to infection prevalence and parasite intensity are gametocyte 
density and multiplicity of infection. 
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Capacite d'infection des gametocytes aux 
moustiques : selection de variables basee sur les 
forets aleatoires, et modeles modifies en zero 

Resume : 

De nouvelles strategies de reduction de la transmission du paludisme necessite 
la comprehension des facteurs pouvant influencer l'intensite d'oocystes et la 
prevalence d'infection chez le moustique vecteur. Jusqu'a maintenant, les modeles 
mathcmatiques ne sont pas parvenus a identifier une relation claire entre les 
gametocytes de Plasmodium et leur capacite a infecter les moustiques. La ca- 
pacite vectorielle du moustique peut dependre de multiple facteurs tels que la 
densitc, le sexe-ratio et la maturitc du parasite, ainsi que des facteurs immu- 
nitaircs du moustique. Dans ce papier, nous evaluons l'influence de la den- 
site et de la diversite genetique du parasite sur le succes de sa transmission 
a travers le moustique vecteur. Nous disposons de donnees decrites par di- 
verses variables dont le nombre est de l'ordre de la taille de l'echantillon, ce 
qui constitue un obstacle a l'usage de modeles classiques de regression tel que 
le modele lineaire generalise. Nous considerons alors l'importance des variables 
des forets aleatoires pour selcctionncr les variables les plus influentes. Les vari- 
ables selectionnees sont ensuite evaluees par le modele binomial negatif modifie 
en zero, qui permet de tenir compte a la fois de la sur-dispersion et des sources 
possibles des moustiques non-infectes. Nous trouvons que les variables les plus 
importantes reliees a la prevalence d'infection et l'intensite parasitaire sont la 
densite de gametocytes et la multiplicite de l'infection. 

Mots-cles : Plasmodium, moustiques, selection de variables, forets 

ALEATOIRES, MODELES MODIFIES EN ZERO. 
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1 Introduction 



Malaria still represents a major health problem in more than one hundred trop- 
ical countries. The disease is caused by the parasite Plasmodium and its trans- 
mission occurs through the bite of an infective Anopheles female mosquito. In 
the last decades, insecticide and drug resistance has seriously hampered its 
control and alternative measures are urgently needed. Because Plasmodium 
transmission relies on the success of its development within the mosquito vec- 
tor, called the sporogonic development, new strategies to fight malaria aim at 
controlling Plasmodium during the mosquito life cycle. Within the mosquito 
vector, malaria parasites undergo several life-stages and their successful devel- 
opment from one transition stage to an other will determine the outcome of 
infection. When ingested with the blood meal, male and female gametocytes 
fuse to form a zygote that differentiates into a mobile ookinete. The ookinete 
then traverses the midgut epithelium and encysts as an oocyst along the basal 
lamina. The oocyst, after several days of maturation, will release large number 
of sporozoites into the hemocoel. Sporozoites that will reach salivary glands will 
then be transmitted to a new host at a subsequent blood meal. Plasmodium 
parasites encounter severe losses during these successive phases and factors con- 
trolling parasite densities are not yet completely understood. Blood digestion 
processes and mosquito immune responses account for parasite dec rease, but 
also t he complex interplay between v ector and parasite genotypes (jVaughanl . 
20071 : IJaramillo-Gutierrez et all [2009h . 



Transmission of Plasmodium falciparum sexual stages, the gametocytes, to 
the mosquito mainly depends on their maturity and density in the human host at 
the time of the mosquito bite. Even if it has been demonstrated that high game- 
tocyte densities do not guarantee high mosquito infection, a greate r infection of 



mosquitoes is generally observed with higher gametocyte densities fflogh et al 



1998 



2007 



Paul et al 



Drakclcy et al., ll999HTargett et all 120011 IBoudin et all 1200 
Nwakanma et all [2008) . Gametocyte densities vary greatly between hu- 
man hosts, due to host acquired immunity, genetic factors of the parasite strain 
and other environmental parameters (blood quality, fever, anemia, anti-malarial 
drug uptake). In malaria endemic are as, human hosts are typically infected 
with multiple genotypes of parasites toav et all Il992 : Babiker et al. . 19991: 
lAnderson et all 2000t Nwakanma et all 20081) and within- host competition of 
parasite genotypes is likely to drive transmission success. Indeed, from ex- 
periments using Plasmodium animal models, it has been shown that different 
genotypes of parasites in mixed infections have distinct ability to transmit, the 



more viru l ent strain having a comp etitive advantage (|de Roode et al. , l2005t 



Bell et all l2006t IWargo et all 120071 ). If different models have been proposed 



to correlate the gametocyt e density to the transmission success of wild isolates 
of Pl asmodium falciparum ( Pichon et al. . 2000l IBoudin et all 1200,4 IPaul' et all 



20071 ). to date no study related the outcome of infection to parasite complexity 
within the gametocyte population. Understanding relationships between co- 
infecting genotypes and how they influence the disease transmission is however 
of great importance as these might help to predict the spread of resistant strains 
of parasites and guide strategies for malaria control. 

In this paper, we investigate how density and genetic diversity of gameto- 
cytes impact on infectiousness to mosquitoes. We analyze mosquito infection 
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data consisted of oocyst counts with corresponding gametocyte data: densities 
and genotypes at 7 microsatellite loci. Data were obtained from experiments 
of membrane feeding of a local colony of Anopheles gambiae mosquitoes on 
blood from volunteers naturally infected by Plasmodium falciparum isolates 
from Cameroon. Gametocyte genotypes are occurrences of several unordered 
categorical variables, each having numerous levels. Therefore the number of 
variables plus attendant interactions is at least of order of the sample size. We 
considered as response variables: the intensity of infection as measured by the 
mean of oocyst counts in infected mosquitoes, and the infection prevalence de- 
fined by the proportion of mosquitoes that became infected. The high number 
of variables in our data set will obviously lead to over-fitting of many familiar 
regression techniques such as general linear model (GLM). In addition, we deal 
with unordered categorical variables with several leve l s and potentially accom- 
panying interactions. Therefore, following Segal et all ( 200lh . we use regression 
trees techniques. 

We address the problem of selecting the most influent variables related to 
the r esponse variab l e by applying a variable selection procedure, which comes 
from I Genuer et al. (2010), and is based on variable importance from random 
forests (jBreimanl . 120011 ). The resulting method is completely non-parametric 
and thus can be used on data with a large number of variables of various types. 
Moreover, it solves the two following constraints about variable selection: 1) to 
find all variables highly related to the response variable; and 2) to find a small 
number of variables sufficient for a good prediction of the response variable. The 
selected variables are then assessed in a modeling for oocyst count which takes 
into account the complexity of the experiment we deal with. The key point of 
our modeling is the introduction of a new unobserved variable that enables to 
distinguish two possible sources of non infected mosquitoes. Indee d, the het 



eroge neity in the quantity and quality of gametocytes i n blood- meal ( Vaughan 



2007J), and natural variation in mosquito susceptibility (jRiehle et all l2006| ) are 
well known phenomena. We then suggest here that mosquitoes with no oocyst 
can be non infected cither because they did not ingest enough gametocytes with 
the blood-meal, or because they were refractory to the ingested parasites. We 
fitted a Zero-Inflated (ZI) model, which is a two components mixture model 
combining a point mass at zero with a proper count model. Since we deal 
with count data, the typical candidate models were Zero-Inflated Poisson (ZIP) 
and Zero-Inflated Negative Binomial (ZINB); ZINB having a slight advantage 
because it captures over-dispersion which is likely to appear in such data. 

As a result, we found that the gametocyte density and the multiplicity of 
infection were the most influent variables for both infection prevalence and para- 
site intensity. High gametocyte density and low multiplicity of infection resulted 
in high parasite intensity, whereas high infection prevalence came from high ga- 
metocyte density and high multiplicity of infection. 

The rest of the paper is organized as follows. Section [5] presents the data 
to be analyzed in Subsection 12.11 the principle of variable selection based on 
variable importance from random forests in Subsection l2.2l and the modeling of 
oocyst count in Subsection 12.31 Section [3] is devoted to the application of these 
methods on our data. Finally a discussion is given in Section U 
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Figure 1: Alleles detected for the 7 microsatellite loci and their frequencies in 
Plasmodium falciparum gametocyte carriers. 



2 Material and methods 

2.1 Data collection and description 

The data we considered consist of parasite densities and genotypes at 7 mi- 
crosatellite loci for gametocyte isolates of Plasmodium falciparum on one hand, 
and oocyst counts 7 days post feeding for each engorged females on the other 
hand. Plasmodium falciparum gametocyte carriers were identified among asymp- 
tomatic children aged from 5 to 11 in primary schools of the locality of Mfou, 
a small town located 30 km apart from Yaounde, the Cameroon capital city. 
Volunteers were enrolled upon signature of an informed consent form by their 
parents or legal guardian. The protocol was approved by the National Ethics 
Committee of Cameroon. Gametocyte densities were expressed as the num- 
ber of parasites seen against 1 000 leukocytes in a fresh thick blood smear, 
assuming a standard concentration of 8 000 leukocytes per \il (see Table Q] for 
summary of log-transformed gametocyte densities). Venous blood (2 to 3 mL) 
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was taken from consenting gametocyte carriers, centrifuged and the serum re- 
placed by a non-immune AB serum. This procedure avoids the introduction of 
human transmission blocking factors in the experiment. 3 to 5 old females of a 
laboratory strain of Anopheles gambiae mosquito were used for the membrane 
feeding assays placed in cups of approximately 60-80 mosquitoes. Females were 
allowed to feed for 20 minutes through a Parafilm membrane on glass feed- 
ers maintained at 37°C and fully engorged females were kept in insectar until 
dissections 7 days post-infection. Midguts were removed, stained in a 0.4% 
Mercurochrome solution and the number of developed oocysts counted by light 
microscopy (X20 lens). A total of 7 364 mosquitoes (see Table[T|) were dissected, 
giving a mean of 39 females per experiment. 

Gametocytes were separated f rom 1 ml of serum free blood using MACS® 
columns as previously described ( Ribaut et al. . 20081) . DNA extractions from 



purified gametocytes were performed with DNAzol® and 20 ng of gameto- 
cyte DNA were subjected to whole-genome amplification (WGA) using the 
GenomiPhi V2 DNA Amplification Kit to generate sufficient amounts of DNA 
for microsatellite geno typing. Genetic polymorphism was assessed at 7 mi- 



crosatellitc loci as previously described (jAnnan et all 120071 ). Their chromo- 



some location and GcnBank accession number are as follows: POLYa (chr. 4, 
G37809), TA60 (chr. 13, G38876), ARA2 (chr. 11, G37848), Pfg377 (chr. 12, 
G37851), PfPK2 (chr. 12, G37852), TA87 (chr. 6, G38838), and TA109 (chr. 6, 
G38842). Alleles were analyzed using GeneMapper@ software. Multiple alleles 
were scored when minor peaks were at least 20% of the height of the predomi- 
nant allele. The number of observed alleles per locus is 21, 9, 10, 5, 15, 10 and 
17 respectively (see Figure [1]). 



Table 1: Summary of the numbers of mosquitoes per isolate (N) and log- 
transformed of gametocyte densities (log_gameto). 



Min. 


1st Qu. 


Median 


Mean 


3rd Qu. 


Max. 


N 11.000 
log_gameto 1.816 


29.000 
3.156 


38.000 
3.832 


39.380 
3.973 


47.000 
4.612 


79.000 
7.742 



Feedings for which the number of dissected mosquitoes was below 20 were 
not considered. Then 110 experiments were included in the analysis. 

2.2 Variable selection procedure 

The selection procedure we considered is based on variable importances (VI) 
from random forests (RF). The principle of RF is to aggregate regression or 
classification trees built on several bootstrap samples drawn from the learning 
set (more details are given in Appendix IA"]) . It is shown to exhib it very good 
performance for lots of diverse applied situations (jBreimanL l200ll ). Moreover, 
it computes a variable importance index, defined in Appendix [21 Roughly, this 
index is a measure of the degradation of forest predictions when values of a 
variable are permuted. 

RF variable imp ortance is the key point of the selection procedure (see 
Genuer et all ( 20101) for more backgrounds on RF variable importance). This 



procedure presents two main benefits. First the method is completely non- 
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parametric and can be applied on data with lots of variables of various types. 
Second, it achieves two main variable selection objectives: (1) to magnify all 
the variables related to the response variable, even with high redundancy, for 
interpretation purpose; (2) to find a parsimonious set of variables sufficient for 
prediction of the outcome variable. 

Let us now describe the procedure, which comes from lGenuer et al.l d201Cft 



with the following algorithm . The R package randomForest (|Liaw and Wiener 
20021: |R Development) . l2009h was used in all computations. 



To both illustrate and give details about this procedure, we apply it on 
a simulated dataset with n = 200 observations described by 25 continuous 
variables and 25 binary variables. We assume standard normal distribution 
A/"(0, 1) for all continuous variables and binomial distribution £>(0.5) for all 
binary variables. We consider the following linear model 

25 25 

in which only 8 over a total of p — 50 variables are related to the outcome, 
the others being just noise. The set of significant variables is composed by the 
first 4 continuous variables (X Cj ) 1K j <4 and the first 4 binary ones (X^ ) 1< - <4 - 
Their associated coefficients are given by 

(fly)l<*<25 = (A-)l<j<25 = (4,4,2,2,0,...,0). 

We also assume a 0.9 correlation between X Cl and X C21 X C3 and X C4 , X^ and 
Xi, 2 , and Xb 3 and X\ ll . 

The selection process uses a certain number nfor of random forests. In 
addition of this number, the user has also to provide the number ntree of trees in 
each random forest, and the number mtry of variables among which to select the 
best split at each node. The default parameters in the R package randomForest 
we used are mtry = p/3, ntree = 500. In our example, we choose the following 
parameters: mtry = p/3, and we choose nfor = 50 and ntree = 1000 to increase 
the VI stability. The results are summarized in Figure [2] 

Let us detail the main stages of the procedure together with, in italics, the 
results obtained on simulated data. In the following, out of bag (OOB) error 
refers to an estimation of the prediction error (which is defined in Appendix lAl 
and is close to a cross-validation estimate). 

• Elimination step 

First the variables are sorted in descending order according to VI (averaged 
from the nfor runs). 

The result is drawn on the top left graph. The 8 variables of interest arrive 
in the first 8 positions of the ranking. 

Keeping this order in mind, the corresponding standard deviations of VI 
are plotted. A threshold for importance is computed using this graph. 
More precisely, the threshold is set as the minimum prediction value given 
by a Classification And Regr ession Tree (CART) m odel fitting this curve 



(for details about CART, see lBreiman et al.l (|1984|) ). Then only variables 



with an averaged VI exceeding this level are kept. This rule is, in general, 
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Figure 2: Variable selection procedures for interpretation and prediction for 
simulated data 



conservative and leads to retain more variables than necessary, in order to 
make a careful choice later. 

The standard deviations of VI can be found in the top right graph. We can 
see that true variables standard deviation is large compared to the noisy 
variables one, which is very close to zero. The threshold leads to retain 
Pelim — 14 variables. Note that the threshold value is based on VI standard 
deviations (top right panel of Figure^! while the effective thresholding is 
performed on VI mean (top left panel of Figure^). 

• Interpretation step 

Then, OOB error rates (averaged on nfor runs and using default param- 
eters) of the nested random forests models are computed; starting from 
the one with only the most important variable, and ending with the one 
involving all important variables kept previously. The set of variables 
leading to the smallest OOB error is selected. 

Note that in the bottom left graph the error decreases and reaches its min- 
imum when the first Pinterp = 9 variables are included in the model. This 
set of selected variables for interpretation contains the 8 true variables 



INRIA 



Gametocytes infectiousness to mosquitoes using random forests 



9 



plus one noisy one. Note that the associated error is closed to the one 
of the model with the 6 first variables (see bottom left panel of Figure^! 
suggesting that a smaller model should be preferred for prediction purposes. 

• Prediction step 

Finally a sequential variable introduction with testing is performed: a 
variable is added only if the error gain exceeds a data-driven threshold. 
The rationale is that the error decrease must be significantly greater than 
the average variation obtained by adding noisy variables. 

The bottom right graph shows the result of this step, the final model for 
prediction purpose involves 6 out of the 8 true variables. It is of interest 
that each of the two true variables non-selected is correlated to one selected 
variable. The threshold is set to twice the mean of the absolute values of 
the first order differentiated OOB errors between the model with Pinterp — 
9 variables (the model we selected for interpretation, see the bottom left 
graph) and the one with all the p e u m — 14 variables : 



where errOOB(j) is the OOB error of the RF built using the j most 
important variables. 

Since the number of variables after the variable elimination step is small (14), 
we tried some variants more computationally expensive, in order to validate the 
two last steps of the algorithm. Instead of the interpretation step, we launch a 
forward procedure. The principle is, at each time, to seek the best variable (in 
terms of OOB error rate, averaged on nfor runs and using default parameters) 
to add in the current variable set. The set of variables leading to the smallest 
OOB error is then selected. 

For our example, it leads, as the interpretation step, to retain the 8 true 
variables plus one noisy variable (this last noisy variable being different from 
the one selected by interpretation step). We remark however that the initial 
ranking according to VI is quite changed with this procedure. 

To validate the prediction step, we tried an exhaustive procedure, i.e. we 
compute the OOB error rate (averaged on nfor runs and using default parame- 
ters) for all models formed with the variables selected by the forward procedure. 
The set of variables leading to the smallest OOB error is then selected. 

This procedure selects all 9 variables selected previously. 

This validates the interpretation and the prediction step of our algorithm, 
since the variables sets in these variants are close to ours. In addition the errors 
reached by the two procedures are comparable. However this comparison was 
done on the easy simulated dataset we considered in this section. 

2.3 Modeling oocyst count with Zero-Inflated models 

The key point of our modeling is to consider that there are two possible sources 
of non-infected mosquitoes. First, some mosquitoes may not ingest enough para- 
sites with sufficient sex-ratio to ensure fertilization. The reason is seemingly the 
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high heterogeneity in the number of gametocytes in blood-meals (jPichon et al 
2000h . Second, so me other mosquitoes may not be genetically susceptible to the 
parasites ingested ( Riehle et all 120061) . We introduce a new variable U materi 



alizing this situation of non-infected mosquitoes: for mosquito j fed with blood 
coming from gametocytes carrier i, 

jj f 1 if enough parasites are present in its blood-meal 
h3 [ otherwise. 

Ui j is an unobserved variable in our experiment. We assume that for a given 
i, Ui t i,..., Ui tn( are independent and identically distributed. Here rii is the 
number of mosquitoes associated to gametocytes carrier i. For any gametocytes 
carrier i, denote by 

tr := P {U^ = 0) 

the probability that mosquito j does not ingest enough gametocytes in its blood- 
meal. Let Yij be the number of oocysts developed in mosquito j associated to 
gametocytes carrier i. The probability distribution of Yij is given by 

P (Yij = y id ) = 7ril (w< . = o) + (1 - tt,) P (Y itj = y i4 \ U id = 1) , (1) 

where P (Yij — y id \ Uij — 1) is a suitable count probability distribution. 

Consequently, for any gametocytes carrier i, the zero class is a mixture of 
two components with nt and 1 — 7^ as the mixture proportions. The resulting 
model of probability distribution is known as a zero-inflated count model. Such 
a model is a two components mixture model combining a point mass at zero 
with a count distribution s uch as Poisson, geometric or negative binomial (see 
Zeileis and Jackmanl ( 2008 ) and references therein) . Thus there are two sources 



of zeros: zeros may come from point mass or from count component. In our 
framework, the zeros coming from the point mass are assumed to represent 
mosquitoes which did not ingest enough gametocytes to produce an infection. 

Let Xi :— E (Yij\ Uij — 1) be the conditional mean of the count component. 
In the regression setting, both the mean A^ and the excess zero proportion 7Tj 
are related to covariates vectors x.j = (x^i, . . . , Xi tP ) and Zj = (z^x, . . . , z-j.g), 
respectively. The components of these covariates are typically the observations 
of the previously selected variables. They contain gametocyte density and / 
or their genetic profile. We consider canonical link functions log and logit for 
the mean of count component and the point mass component respectively. The 
corresponding regression equations are 

Xi = exp (/Jo + PiXi,i + ■ ■ ■ + PpXi, p ) 
exp (70 + 71A.1 + ... + l v Zi, q ) 



1 + exp (70 + 7l2j,l + . . . + JpZ^q) ' 

where j3 := (/?o, • • • , /3 P ) and 7 := (70, . . . , Jq) are the parameters to be esti- 
mated. Note that different sets of regressors can be specified for the zero inflated 
component and count component. In the simplest case, only an intercept is used 
for modeling the unobserved state (zero vs. count). 

Typical candidate of zero-inflated models for count dat a are zero in f lated 
Poisson (ZIP) and zero-inflated negative binomial (ZINB) fsee lXiang et al 



and references therein) . ZINB and ZIP specifications are given in Appendix [Bj 
For th e estimation of the paramete rs of these models, we used th e package named 
pscl ( Zeileis and Jackman . 20081) in R statistical software ( R Development! . 



2009). 
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3 Application on the real data 
3.1 Variable selection 

Here, the results are given following the main stages of the selection procedure 
given in Subsection 12.21 The details are given once, in the case where the re- 
sponse variable is the infection prevalence of mosquitoes measured by proportion 
of infected mosquitoes. We will just give the selected variables at each stage 
in the other case where the response variable is the mean number of oocysts 
per infected mosquitoes. In these results, the binary variables associated to the 
observed alleles are coded as locus_allele. For example, P f <?377 _093 is allele 
093 at locus Pfgill. In addition to the log-transformed of gametocytes den- 
sity (log-gameto) , we also consider the multiplicity of infection (MOI) defined as 
the maximum number of the observed alleles across the considered microsatellite 
loci. 
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Figure 3: Variable selection for interpretation and prediction. The response 
variable is the infection prevalence measured by the proportion of infected 
mosquitoes. 

Here are the main stages of the procedure. 
• Elimination Step 
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Figure 4: Variable selection for interpretation and prediction. The response 
variable is the mean number in infected mosquitoes. 



— The top left panel in Figure [3] gives the VI mean of all the 88 variables 
sorted in decreasing order. 

— The top right panel of Figure [3] plots the standard deviations of VI 
and the fitted CART model. The threshold mine art represented by 
the horizontal dashed line leads to retain p e Um = 36 variables over 
88. 

• Interpretation Step 

This step is illustrated in the bottom left panel of Figure [3J in which 
the minimum OOB error rate is reached with Pinterp = 11 variables for 
interpretation: 

S in t erp = {log_gameto, P/ 3 377_093, P/PK2S80, MOI, 

P/.g377.102, P/PK2S83, J P/. 9 377_099, TA60_071, 

PfPK2Am, P/PK2J.66, POLYa.135}. 

• Prediction Step 

The bottom right panel in Figure [3] shows the behavior of the OOB error 
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of the nested models corresponding to the selected variables for prediction: 
S pred = {log.gameto, Pfg 377.093, MOI, PfPK2_183}. 

The 4 selected variables in S pre d lead to the OOB error of 0.074. We also launch 
the variant based on forward and exhaustive search of the selection procedure. 
Finally it retains a set of 9 variables containing S pre d- The associated OOB 
error is 0.062 which is not far from 0.074. So we prefer a model with variables 
in Sp re d which is more parsimonious. 

The same procedure was applied when the outcome variable is the infection 
intensity as measured by the mean number of oocysts in infected mosquitoes. 
FigureHlgives the behavior of VI and the OOB error at each stage of the selection 
procedure. 25 variables were selected by thresholding the VI in the first stage, 
the 2 most important being log_gameto and M OI. Even if only log_gameto is 
selected in the interpretation and prediction stages, we also keep MOI . Indeed, 
as can be seen in the bottom left graph of Figure SI the model with these two 
variables is still competitive compared with the model built with log-gameto 
only. 

3.2 Zero-Inflated models fitting oocyst count 

Zero-Inflated negative binomial (ZINB) and Poisson (ZIP) were fitted to the 
data in two situations: (i) using only log-transformed of the gametocyte density 
as variable, (ii) using the set of variables selected for prediction of the infection 
prevalence or the infection intensity (see Subsection 13. ip . The estimates of the 
parameters of ZINB and ZIP models are given in Table [5] and [31 

In situation (i) , it is of interest how the zero counts are captured by the two 
models: they perfectly predict the observed number of non infected mosquitoes 
(see the left panel of Figure [5]). Also, the estimates of the mean number of 
oocysts from both two models are similar (see the right panel of Figure . But 
according to the X 2 goodness-of-fit test (X 2 = 48.162, df — 45, p.value > 
0.3461 for ZINB against X 2 = 2964.606, df = 46, p.value = for ZIP model), 
ZINB model is more adapted to our data. Over-dispersion is probably the main 
reason: there are more mosquitoes with no or few oocysts than the ones with 
high oocyst loads. ZIP model underestimates the number of mosquitoes with 
lower oocyst loads (see the left panel of Figure [5]) . We then consider the ZINB 
model in the rest of the analysis. 

In situation (ii), since the data are over-dispersed, only ZINB is considered. 
The selected variables in the prediction step of our variable selection process 
using the infection prevalence as response variable are used in point mass com- 
ponent, and the ones using the infection intensity as response variable are used 
in the count component. Recall that the infection prevalence is measured by 
the proportion of mosquitoes that became infected, and the infection intensity 
by the mean number of oocysts in infected mosquitoes. It is natural to link 
infection prevalence and infection intensity to zero and count components re- 
spectively. We found that allele PfPK2_183 is the only variable not significant 
(Z = —0.8329, p.value > 0.40). In contrary, gametocyte density log-gameto, 
gametocyte genetic complexity MOI and allele 093 of locus Pfg377 signifi- 
cantly influence the mean oocyst load in mosquitoes in count component. The 
significance of the gametocyte density confirms the result obtained by ZINB 
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model in situation (i). The significance of the effect of MOI in both zero and 
count components is very interesting: it is more important in the zero compo- 
nent (t-test Z = —4.5711, p.value < 4.9e — 06) than in the count one (t-test 
Z = —2.1058, p.value < 3.5e — 02). Also note that the correlation is negative 
in both two components {Pmoi — —0.0333 and 'Jmoi = —0.1499 in count and 
zero components respectively). So mono infected gametocyte isolates increase 
the probability that a mosquito do not ingest enough parasites to ensure the 
transmission success of Plasmodium through its vector mosquito. Hence, low 
values of MOI tend to decrease the infection prevalence. In contrary, a lower 
genetic diversity of gametocytes in an isolate increases the mean number of 
oocysts in the count component. Also note that the presence of allele 093 of 
the genetic marker Pfg377 increases the proportion of non-infected mosquitoes 
(7p/ S 377.093 = 1.2242, SE = 0.1204; t-tcst Z = 10.177, p - value < 2.7e - 24). 



Table 2: Maximum likelihood estimates of the parameters of ZINB and ZIP 
models using only log_gameto as variable for both zero and count components. 
Significant codes: '***'; 0.001 '**'; 0.01 '*'; 0.05 '.'; 0.1 ' '. X 2 Goodness- 
of-fit test: X 2 = 47.0992, df = 45, p.value > 0.3866 for ZINB against X 2 = 
2834.848, df = 46, p.value = for ZIP model 







Estimate 


Std. Error 


z value 


Pr(>|z|) 




ZINB 


Count 


(Intercept) 


-1.3021 


0.1163 


-11.1985 


4.1E-29 






log_gameto 


0.8402 


0.0257 


32.6835 


2.7E-234 






Log(theta) 


-0.5693 


0.0557 


-10.2235 


1.6E-24 




Zero 


(Intercept) 


0.0029 


0.2405 


0.0119 


9.9E-01 






log_gameto 


-0.2618 


0.0531 


-4.9294 


8.2E-07 




ZIP 


Count 


(Intercept) 


-0.7941 


0.0199 


-40.0016 


0.0E+00 






log_gameto 


0.7717 


0.0036 


213.9455 


0.0E+00 




zero 


(Intercept) 


1.4508 


0.1284 


11.2996 


1.3E-29 






log_gameto 


-0.4383 


0.0316 


-13.8930 


7.0E-44 





4 Discussion 

Plasmodium development within its vector mosquito follows complex biological 
processes and factors controlling parasite dynamics are not well understood. In 
the rodent malaria parasite Plasmodium berghei, it has been previously shown 
that the efficiency of parasite transmission from one developmental stage to 
another followed density-dependent models and the best fitted mathematical 



mode l differed from one developmental transition to the other one (|Sinden et al 



20071 ). For natural populations of Plasmodium falciparum, the human malaria 



parasite, modeling becomes more challenging because of unkn own genetic f actors 



and uncontrolled environmental parameters. Nonetheless, iPaul et al.l ([2007Q 
found a sigmoid relationship between Plasmodium falciparum gametocyte den- 
sity and mosquito transmission and the authors argued that parasite aggregation 
within mosquitoes represents an adaptive mechanism for transmission efficiency. 
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Table 3: Maximum likelihood estimates of the parameters of ZINB and 
ZIP models using {log.gameto, P/g377_093, MOI, PfPK2A83} and 
{log-gameto, MOI} as variables in the zero and count components respectively. 
Significant codes: '***'; 0.001 '**'; 0.01 '*'; 0.05 V; 0.1 ' '. 







Estimate 


Std. Error 


z value 


Pr(>]z]) 




ZINB 


Count 


(Intercept) 


-0.9985 


0.1436 


-6.9539 


3.6E-12 






log_gameto 


0.8009 


0.0261 


30.6432 


3.3E-206 






MOI 


-0.0333 


0.0158 


-2.1058 


3.5E-02 


* 




Log(theta) 


-0.5210 


0.0500 


-10.4296 


1.8E-25 




Zero 


(Intercept) 


0.9651 


0.2679 


3.6030 


3.1E-04 






log_gameto 


-0.3769 


0.0534 


-7.0615 


1.6E-12 






Pfg377.093 


1.2242 


0.1204 


10.1717 


2.7E-24 






MOI 


-0.1499 


0.0328 


-4.5711 


4.9E-06 






PfPK2.183 


-4.5225 


5.4301 


-0.8329 


4.0E-01 





Observed and expected frequencies 



Histogram for observed frequencies 

Expected frequencies for ZINB (p.value > 0.(54) 

— Expected frequencies for ZIP (p.value=0.00] 



lb*** 



10 20 30 40 

Number of oocysts 



Empirical and expected means 



Number ot oocysts lor each mosquito 
Empirical mean ot the number ol oocysts 
Expected mean of the number of oocysts by ZINB 
Expected mean of the number ot oocysts by ZIP 




Log gametocytem 



Figure 5: The right panel gives observed and predicted frequencies from ZINB 
and ZIP models, and the right one the empirical and predicted mean number 
of oocysts versus log-gametocytemia. 



The great variability in Plasmodium falciparum oocyst numbers observed in nat- 
ural Anopheles gambiae populations suggests that parasite transmission is the 
result of complex interactions between vectors and parasites, which rely on both 
genetic and environmental factors. Understanding factors that determine trans- 
mission intensity and then parasite population structure is of crucial importance 
in predicting the impact of current malaria control strategies. 

In this study, we analyzed patterns of mosquito infection from experiments 
performed with field isolates of Plasmodium falciparum from Cameroon, an area 
of high malaria endemicity. We considered as response variables: the intensity of 
infection as measured by the mean of oocyst counts in infected mosquitoes, and 
the infection prevalence defined by the proportion of mosquitoes that became in- 
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fected. Gametocyte isolates were genetically characterized at seven microsatel- 
lite loci, allowing estimation of the number of co-infecting parasite clones, the 
MOI, and of the genetic polymorphism, given by the number of alleles at each 
locus. In such a situation with potentially a high number of unordered cat- 
egorical variables with numerous levels and accompanying interactions, many 
familiar statistical techniques such as GLM over-fit the data. Then we had to 
face the problem of selecting the most important variables related to the out- 
come variables. We have addressed this issue with a selection procedure based 
on variable importance from random forests. The procedure has two main ben- 
efits. First, it is completely non-parametric and thus can be used on data with 
lots of variables of various types. Second, it answers the two distinct objectives 
about variable selection: (1) to find all variables related to the outcome variable 
and (2) to find a small number of variables sufficient for a good prediction of 
the outcome variable. 

Recall that we are in a critical situation with the number of variables of 
the order of the sample size (n — 110). The application of the variable se- 
lection procedure on our data revealed that only 4 among the 88 variables we 
considered suffice to predict the infectiousness of Plasmodium falciparum to 
Anopheles gambiae in our experimental settings. The procedure indicates that 
the log-transformed of gametocyte density is the most influent variable and is 
positively correlated for both infection prevalence and infection intensity. This 
probably reflects that Plasmodium parasites have developed complex and diverse 
strategies to ensure their transmission through the mosquito vector. The fact 
that higher oocyst counts are found for higher gametocyte densities conforms 
to previous observations show ing that infectiou sness generally increases with 
gametocytemia. Interestingly, iPaul et al.l (|2007t ) described upper gametocyte 
densities at which mosquito infection rates level off, which is consistent with our 
results. In their models, mosquitoes with no oocyst were treated as non infected 
without further consideration about the putative factors responsible of the non 
infected status. However, a mosquito population fed on the same gametocyte 
carrier results in individuals carrying high number of parasites while others do 
not have any. Failure to infection of a mosquito can result from var ious factors 
such the heterogeneity of gametocyte environment (IVaughanl. I2007T) and natu - 



ral variation in mosquito susceptibility in the other hand ( Riehle et al. . 20061) 



We have described in this article an approach based on that the non-infected 
mosquitoes represent two distinct populations: one genetically refractory vector 
population and another population for which the no-oocyst status results from 
other biological or interacting factors. Further study to quantify the gametocyte 
uptake in mosquitoes fed on a single carrier would help to determine the individ- 
ual variation of gametocyte density between blood-meals, and thus the real part 
of mosquitoes that are refractory and those that did not develop any oocyst be- 
cause of other environmental factors. Nonetheless, our model perfectly predicts 
the number of non infected mosquitoes. Our fitting models revealed that over- 
dispersion of oocysts affects mosquito infection intensity. In addition, a higher 
over-dispersion of oocysts is observed for mosquitoes fed on blood with high 
gametocyte density (over 90 gametocytes/^). The over-dispersed distribution 
of oocysts has often been explained as the result of the aggreg ation of game- 



tocyt es in the capillary blood at the time of the mosquito bite (|Pichon et al 



2000). In this study, mosquitoes were membran e fed and memb rane feeding is 



thought to suppress gametocyte over dispersion (jVaughanl . 120071 ). Nonetheless, 
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the fact that the maximum aggregation is found for high gametocyte densities 
is indicative of aggregation of sexual stages; aggregation may occur within the 
mosquito midgut after parasite intake and genetic factors from the parasites 
may play a role in parasite recognition. This speculation is consistent with the 
hypothesis of adaptive aggregation, where gam ete aggrega t ion would favor fer- 



tiliza tion and then increase infection intensity (jPaul et all l2007t iPichon et al 



2000). However, this increased oocyst burden coincided with a lower infection 
prevalence, possibly indicating that other factors operate in limiting mating (see 
below). 

In malaria endemic areas, intensive use of treatments for malaria has led to 
the emergence of drug-resistant parasites. Despite their low efficacy, malaria 
therapies such as chloroquine (CQ) and sulphadoxine-pyrimcthamine (SP) are 
still widely used in sub Saharan Africa. It has been shown that, upon treatment, 
drug-resistant paras ites have a selective advan tage, leading to higher transmis- 
sion by the vector (jrlallett et all . |2004 l2006h . Our samples originated from 
an area with high drug pressure and volunteers carrying single parasite geno- 
type may have received an early anti malarial treatment that cured them from 
drug-sensitive genotypes, thus allowing an optimal growth and transmission of 
a resistant genotype. However, children who received a malaria treatment in 
the one month period preceding the gametocyte carriage detection were not in- 
cluded in the study and genotyping of pfcrt-K76T mutation in a subset of our 
gametocyte samples identified single infections both as CQ resistant or sensitive 
parasite strains. This result indicates that other factors contribute to the better 
transmission capacity of the mono-infected Plasmodium falciparum isolates. 

We found that the Multiplicity Of Infection is negatively correlated to in- 
fection intensity and positively correlated to infection prevalence (through the 
count and zero components respectively in the ZINB model). This indicates 
that the genetic complexity of gametocyte populations modulates the mosquito 
infection outcomes in an opposite manner: while gametocyte isolates contain- 
ing a single clone of Plasmodium falciparum resulted in a higher mean number 
of oocysts in infected mosquitoes, gametocyte isolates with multiple genotypes 
gave rise to a higher infection prevalence. These results may suggest that malaria 
parasites use kin discrimination to adapt strategies allowing optimal parasite 
transmission. 

Our results showed that the genetic complexity of gametocyte isolates affects 
the mosquito infection intensity. Mosquito infections with isolates of lower com- 
plexity resulted in higher oocyst counts. This may reflects a higher virulence 
of genotypes in these infections, where the gametocyte genotypes in the mono- 
infected isolates could have suppressed their competitors in a prior step of the 
infection, within the human host. Nonetheless, the lower infection prevalence 
in mono clonal infections indicates that the higher number of oocysts arises at 
the cost of a reduced ability to infect the mosquito vector population. This 
could result from blood quality /quantity such as agglutinating antibodies or 
anaemia. It was shown that mixed infections resulte d in increased anaemia , 
a possible adapti ve response for sex ratio adjustment ( Tavlor and Readl . 1998t 



Paul et all 120041 ) . Sex allocation theory predicts that sex r atio becomes less 



female-biased as clone number increases ( Read et al ], Il992t IPaul et all l2002t 



Reece et al. , l2008t ISchall 120091) . Then, if parasite aggregation is an adaptive 
trait to promote gamete fertilization, by contrast the highly female biased sex 
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ratio in mono infected isolates will affect infection prevalence because male avail- 
ability will constitute a limiting factor for mating. 

Our results may have important implications for the genetic structuring of 
Plasmodium falciparum populations. For Plasmodium falciparum, fertilization 
of gametes can occur between genetically-identical gametes (inbreeding) or be- 
tween different gametes (outbreeding). Levels of inbreeding differ from one 
malaria area to anothe r but they roughly correlate with the disease endemicity 
(|Anderson et all |2000). [n areas of high malaria endemicity, inbreeding levels 



are generally more reduced, mostly because parasite genetic diversity is high and 
multiple infections predominant. However, population genetics studies, after 
genotyping of oocysts from wild mosquitoes collected in intense malaria trans- 
mission areas, gave rise to conflictin g results and the extent of inbreeding in nat 
ural s ettings remains controv ersial ( Razakandrainibe et al. . 20051 Annan et al. 



20071 : iMzilahowa et all 120071 ). T he higher fitness of inbred parasites, as sug- 



gested in this study and oth ers ([Hastings and Wedgwood-Qppenheiml . 11997 ; 
Razakandrainibe et al. , 120051) . could explain the departs from panmixia fre- 
quently found in areas of high malaria transmission. 

Finally, our results comfort the idea that malaria parasites are able to dis- 
criminate the genetic complexity of their infections and to adjust accordingly 
adaptive traits implicated in transmission (aggregation, sex ratio). Decipher- 
ing specific processes involved in parasite recognition and competition within 
the mosquito vector would help for our understanding of within host behavior 
of malaria parasites. This may have important implications for future malaria 
interventions strategies. 
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Appendices 

A Random Forests 

RF estimator 

The principle of random forests is to aggregate a given number ntree of 
binary decision trees built on several bootstrap samples drawn from the learning 
set. The bootstrap samples are obtained by uniformly drawing n samples among 
the learning set with repetition. The decision trees are fully developed binary 
trees and the split rule is the following. 

First, the whole dataset (also called the root of the tree) is split into two 
subsets of data (called two children nodes). To do that, one randomly chooses 
a given number mtry of variables, and computes all the splits only for the 
previously selected variables. A split is of the form {X 1 < s} U {X 1 > s}, which 
means that data with the i-th variable value less than the threshold s go to the 
left child node and the others to the right one. Finally the selected split is the 
one minimizing the variance children nodes. 

Then, one restraints to one child node, randomly chooses another set of mtry 
variables and calculates the best split. And so on, until each node is a terminal 
node, i.e. it comprises less than 5 observations. 

A new data item X, starting in the root of the tree, goes down the tree 
following the splits and falls in a terminal node. Then the tree predicts for X, 
Y the mean of response of data in this terminal node. To finally get the RF 
predictor, one aggregates all the tree predictors by averaging their predictions. 

RF error estimate: the OOB error 

Inside the variable selection procedure, we use an estimation of the prediction 
error directly computed by the RF algorithm. This is the Out Of Bag (OOB) 
error and is calculated as follows. Fix one data in the learning sample, and 
consider all the bootstrap samples which do not contain this data (i.e. for 
which the data is "out of bag"). Now perform an aggregation only among trees 
built on these bootstrap samples. After doing this for all data, compare to the 
true response and get an estimation of the prediction error (which is a kind of 
cross- validated error estimate). 

RF variable importance 

Let us now detail the computation of the RF variable importance for the 
first variable X 1 . For each tree, one has a bootstrap sample associated with an 
OOB sample. Predict the OOB data with the tree predictor. Now, randomly 
permute the values of the first variable of the OOB observations, predict these 
modified OOB data with the tree predictor. The variable importance of X 1 is 
defined as the mean increase of prediction errors after permutation. The more 
the error increases, the more important the variable is (note that it can be 
slightly negative, typically for irrelevant variables). 

B ZTV and ZTHB specifications 

These two models are defined by equation (TTJ) with the count model given by: 
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• ZIP: 



P(Y id =y id \ U itj = l) 



cxp(-Aj) -j-j 

E(y^| Uij'= 1) 
Var(F i;j | C4, = l) 



Z1NB: 



P(Y iJ =y iij \ Uij = l) 
Var(iy = 1) 



E(iy t^ = i) 

Aj + - s \ i 



where T (t) = x t x e x dx, and 8 is the over-dispersion parameter. The 
expectation and the variance of Y id are given by: 



Ijl{x) := E(r„|) = (l-TT^A; 
Var(F i;j ) = 
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